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Abstract 

Human language is a combination of ele- 
mental languages/domains/styles that change 
across and sometimes within discourses. Lan- 
guage models, which play a crucial role in 
speech recognizers and machine translation 
systems, are particularly sensitive to such 
changes, unless some form of adaptation takes 
place. One approach to speech language 
model adaptation is self-training, in which a 
language model's parameters are tuned based 
on automatically transcribed audio. However, 
transcription errors can misguide self-training, 
particularly in challenging settings such as 
conversational speech. In this work, we pro- 
pose a model that considers the confusions (er- 
rors) of the ASR channel. By modeling the 
likely confusions in the ASR output instead of 
using just the 1-best, we improve self-training 
efficacy by obtaining a more reliable reference 
transcription estimate. We demonstrate im- 
proved topic-based language modeling adap- 
tation results over both 1-best and lattice self- 
training using our ASR channel confusion es- 
timates on telephone conversations. 



1 Introduction 

Modern statistical automatic speech recognition 
(ASR) systems rely on language models for ranking 
hypotheses generated from acoustic data. Language 
models are trained on millions of words (or more) 
taken from text that matches the spoken language, 
domain and style of interest. Reliance on (static) 



many problems of interest, there are numerous hours 
of spoken audio but little to no written text for lan- 
guage model training. In these settings, we must rely 
on language model adaptation using the spoken au- 
dio to improve performance on data of interest. One 
common approach to language model adaptation is 
self-training (Novotney et al., 2009), in which the 
language model is retrained on the output from the 
ASR system run on the new audio. Unfortunately, 
self-training learns from both correct and error rid- 
den transcriptions that mislead the system, a partic- 
ular problem for high word error rate (WER) do- 
mains, such as conversational speech. Even efforts 
to consider the entire ASR lattice in self-training 
cannot account for the ASR error bias. Worse still, 
this is particularly a problem for rare content words 
as compared to common function words; the fact 
that content words are more important for under- 
standability exacerbates the problem. 

Confusions in the ASR output pose problems for 
other applications, such as speech topic classifica- 
tion and spoken document retrieval. In high- WER 
scenarios, their performance degrades, sometimes 



considerably (Hazen and Richardson, 2008). 

In this work, we consider the problem of topic 



training data makes language models brittle (Belle- 
garda, 2001} to changes in domain. However, for 



adaptation of a speech recognizer (Seymore and 
Rosenfeld, 1997[ ), in which we adapt the language 
model to the topic of the new speech. Our nov- 
elty lies in the fact that we correct for the biases 
present in the output by estimating ASR confusions. 
Topic proportions are estimated via a probabilis- 
tic graphical model which accounts for confusions 
in the transcript and provides a more accurate por- 
trayal of the spoken audio. To demonstrate the util- 



ity of our model, we show improved results over 
traditional self-training as well as lattice based self- 
training for the challenging setting of conversational 
speech transcription. In particular, we show statisti- 
cally significant improvements for content words. 

Note that Bacchia ni et al. (2004| ) also consider the 
problem of language model adaptation as an error 
correction problem, but with supervised methods. 
They train an error-correcting perceptron model on 
reference transcriptions from the new domain. In 
contrast, our approach does not assume the existence 
of transcribed data for training a confusion model; 
rather, the model is trained in an unsupervised man- 
ner based only on the ASR output. 

The paper proceeds as follows: Section [2] de- 
scribes our setting of language model adaptation 
and our topic based language model. Section [3] 
presents a language model adaptation process based 
on maximum-likelihood and maximum-aposteriori 
self-training, while Section [4] introduces adaptation 
that utilizes ASR channel estimates. Section [5] de- 
scribes experiments on conversational speech. 

2 Language Model Adaptation 

We are given a trained speech recognizer, topic- 
based language model and a large collection of audio 
utterances (many conversations) for a new domain, 
i.e. a change of topics, but not manually transcribed 
text needed for language model training. Our goal is 
to adapt the language model by learning new topic 
distributions using the available audio. We consider 
self-training that adapts the topic distributions based 
on automatically transcribed audio. 

A conversation C is composed of N speech utter- 
ances are represented by N lattices (confusion net- 
works) - annotated with words and posterior proba- 
bilities - produced by the speech recognizer, [j] Each 
confusion network consists of a sequence of bins, 
where each bin is a set of words hypothesized by the 
recognizer at a particular time. The i-th bin is de- 
noted by bi and contains words {wij}^^, where V 
is the vocabulary. |^] When obvious from context, we 

1 We assume conversations but our methods can be applied 
to non-conversational genres. 

2 Under specific contexts, not all vocabulary items are likely 
to be truly spoken in a bin. Although we use summations over 
all w £ V, we practically use only words which are acoustically 
confusable and consistent with the lexical context of the bin. 



omit subscript j. The most likely word in bin bi is 
denoted by u>j. Let M be the total number of bins in 
the confusion networks in a single conversation. 

We use unigram topic-based language models 
(multinomial distributions over V), which capture 
word frequencies under each topic. Such models 
have been used in a variety of ways, such as in 



PLSA (Hofma nnT200TT ) and LDA flBlei et al., 2003] ), 
and under different training scenarios. Topic-based 
models provide a global context beyond the local 
word history in modeling a word's probability, and 
have been found especially useful in language model 



adaptation ( Tarn and Schultz, 2005 Wang and Stol 



eke, 2007} |Hsu and Glass, 2006| ). Each topic t £ 
{1, . . . ,T} has a multinomial distribution denoted 
by q(w\t),w G V. These topic-word distributions 
are learned from conversations labeled with topics, 
such as those in the Fisher speech corpusj^] 

Adapting the topic-based language model means 
learning a set of conversation specific mixture 
weights A c = (Af, . . . , Aj,), where A^ indicates 
the likelihood of seeing topic t in conversation 
While topic compositions remain fixed, the topics 
selected change with each conversation^] These 
mixture weights form the true distribution of a word: 



(1) 



t=i 



3 Learning A from ASR Output 



We begin by following previous approaches to self- 
training, in which model parameters are re-estimated 
based on ASR output. We consider self-training 
based on 1-best and lattice maximum-likelihood es- 
timation (MLE) as well as maximum-aposteriori 
(MAP) training. In the next section, we modify these 
approaches to incorporate our confusion estimates. 

For estimating the topic mixtures A using self- 
training on the 1-best ASR output, i.e. the 1-best 
path in the confusion network w = {w^fL-^, we 
write the log-likelihood of the observations w: 



M 



£«= y\ogq{wi), 



i=l 



(2) 



Alternative approaches estimate topics from unlabeled 
data, but we use labeled data for evaluation purposes. 
4 Unless required we drop the superscript C. 
5 This assumption can be relaxed to learn q(w\t) as well. 



where q is a mixture of topic models ([T]), for mix- 
ture weights A. We expect topic-based distributions 
will better estimate the true word distribution than 
the empirical estimate q(w) = ^ YmLi = w )i 
as the latter is biased due to ASR errors. Maximum- 
likelihood estimation of A in Q is given by: 

M 

A* = arg max log q(w%) 



arg max 




X t q(wi\t) (3) 



Using the EM algorithm, Q is the expected log- 
likelihood of the "complete" data: 



Q(A<>'\A) 



M 

]T^r%|^)log(A^|i)), 
i=l t 

(4) 

where r^(i|iOi) is the posterior distribution of the 
topic in the z-th bin, given the 1-best word wf, this is 
computed in the E-step of the j-th iteration: 



,U) 



(t\Wi 



Aj J \(^|t) 
E# ^\(m\t') 



(5) 



In the M-step of the j + 1 iteration, the new estimate 
of the prior is computed by maximizing Q, i.e., 



A 



(i+i) 



3.1 Learning A from Expected Counts 



(6) 



Following Novotney et al. (2009| ) we next consider 
using the entire bin in self-training by maximizing 
the expected log-likelihood of the ASR output: 

" M 

J>g<j(Wi 



£'« = E 



(7) 



where Wj is a random variable which takes the value 
w G b{ with probability equal to the confidence 
(posterior probability) Si(w) of the recognizer. The 
maximum-likelihood estimation problem becomes: 



arg max 
A 



£/(!) 



Al 



(8) 



arg max Si(w) log 



\ t q(w\t) 



arg max 



J2tf(w)log(^\ t q(w\t) 
wev V t / 



where tf(w) = ^ Si(w) denotes the expected count 
of word w in the conversation, given by the sum 
of the posteriors of w in all the confusion network 



bins of the conversation ( Karak o~et al., 20 lT) (note 
that for text documents, it is equivalent to term- 
frequency). We again use the EM algorithm, with 
objective function: 



Q(A^), A) = ^ ? /H^rW(t|^)log(A^|t)), 

wev 



(9) 

where r^\t\w) is the posterior distribution of the 
topic given word w computed using ([5]) (but with w 
instead of Wi). In the M-step of the j + 1 iteration, 
the new estimate of the prior is computed by maxi- 
mizing (|9]>, i.e., 



A 



(3+1) 



(10) 



~E t < E weV rH^)(t»' 

3.2 Maximum-Aposteriori Estimation of A 

In addition to a maximum likelihood estimate, we 
consider a maximum-aposteriori (MAP) estimate 



by placing a Dirichlet prior over A (Bacchiani and 
|Roark, 2003) : 



A* = argmax£ + /31og(Dir(A; a)) (11) 

where Dir(A; a) is the pdf of the Dirichlet distribu- 
tion with parameter a: Dir(A; a) = -g^y Yl t A" -1 . 

This introduces an extra component in the opti- 
mization. It is easy to prove that the update equation 
for becomes: 



,(3+1). 



Ztlir^m) + P(a-1) 



(12) 

for the case where only the ASR 1-best is used, and 



A 



(3+1) . 



E w .vtfi^)r U KtH + l3(a-l) 



E* E w ev tf(w)rU)(t'\w) + Tf3(a - 1) 

(13) 

for the case where expected counts are used. Note 
that the [x]+ notation stands for max{0, x}. 

4 Confusion Estimation in ASR Output 

Self-training on ASR output can mislead the lan- 
guage model through confusions in the ASR chan- 
nel. By modeling these confusions we can guide 
self-training and recover from recognition errors. 



M 



& 



& 



(L 



Vm 



Figure 1: The graphical model representations for maximum-likelihood 1-best self-training (left), self-training with 
the ASR channel confusion model (middle), MAP training with the confusion model (right). In all models, each word 
is conditionally independent given the selected topic, and topic proportions A are conversation specific. For the middle 
and right models, each observed top word in the i-th confusion network bin (Wi) is generated by first sampling a topic 
fj according to A, then sampling the (true) word to, according to the topic t t , and finally sampling the observed ASR 
word Wi based on the ASR channel confusion model (p c (v\w)). For expected count models, Wi becomes a distribution 
over words in the confusion network bin h . 



The ASR channel confusion model is repre- 
sented by a conditional probability distribution 
p c (v\w),v,w G V, which denotes the probability 
that the most likely word in the output of the recog- 
nizer (i.e., the "1-best" word) is v, given that the true 
(reference) word spoken is w. Of course, this condi- 
tional distribution is just an approximation as many 
other phenomena - coarticulation, non-stationarity 
of speech, channel variations, lexical choice in the 
context, etc. - cause this probability to vary. We as- 
sume that p c (v\w) is an "average" of the conditional 
probabilities under various conditions. 

We use the following simple procedure for esti- 



mating the ASR channel, similar to that of (Xu et 



al., 2009) for computing cohort sets: 



• Create confusion networks (Mangu et al., 1999) 
with the available audio. 

• Count c(w, v), the number of times words w, v 
appear in the same bin. 

• The conditional ASR probability is computed as 

p c {v\w) = c(v, w)J Yj V > c{v', w). 

• Prune words whose posterior probabilities are 
lower than 5% of the max probability in a bin. 

• Keep only the top 10 words in each bin. 

The last two items above were implemented as a 
way of reducing the search space and the complexity 
of the task. We did not observe significant changes 
in the results when we relaxed these two constraints. 



4.1 Learning A with ASR Confusion Estimates 

We now derive a maximum-likelihood estimate of 
A based on the 1-best ASR output but relies on 
the ASR channel confusion model p c . The log- 
likelihood of the observations (most likely path in 
the confusion network) is: 



M 



£(2) = ^l og p({£. 



(14) 



1=1 



where p is the induced distribution on the observa- 
tions under the confusion model p c (v\w) and the es- 
timated distribution q of ([T]): 

P(™i) = ^2 Q( w )Pc(wi\w). (15) 

Recall that while we sum over V, in practice the 
summation is limited to only likely words. One 
could argue that the ASR channel confusion dis- 
tribution p c {wi\w) should discount unlikely confu- 
sions. However, since p c is not bin specific, unlikely 
words in a specific bin could receive non-negligible 
probability from p c if they are likely in general. This 
makes the truncated summation over V problematic. 

One solution would be to reformulate p c (wi \w) so 
that it becomes a conditional distribution given the 
left (or even right) lexical context of the bin. But this 
formulation adds complexity and suffers from the 
usual sparsity issues. The solution we follow here 
imposes the constraint that only the words already 



existing in 6, are allowed to be candidate words giv- 
ing rise to Wi. This uses the "pre-filtered" set of 
words in 6, to condition on context (acoustic and 
lexical), without having to model such context ex- 
plicitly. We anticipate that this conditioning on the 
words of bi leads to more accurate inference. The 
likelihood objective then becomes: 

M 

£'W= yiog Pi (wi), (16) 



i=i 



with pi defined as: 

Pi{Wi) = 



jrp c (Wi\w), (17) 



i.e., the induced probability conditioned on bin b{. 
Note that although we estimate a conversation level 
distribution q(w), it has to be normalized in each 
bin by dividing by Eweb Q( w )> i n order to condi- 
tion only on the words in the bin. The maximum- 
likelihood estimation for A becomes: 



M 



arg max 



M 

arg max 

i=l 



= arg max log pi (wi) 
X i=i 

, | Et X t q(w\t)p c (wi\w) 



(18) 

Note that the appearance of At in the denominator 
makes the maximization harder. 

As before, we rely on an EM-procedure to max- 
imize this objective. Let us assume that at the j-th 
iteration of EM we have an estimate of the prior dis- 
tribution q, denoted by q^>. This induces an obser- 
vation probability in bin i based on equation ( [P7] ), 
as well as a posterior probability r^\w\wi) that a 
word w G bi is the reference word, given Wi. The 
goal is to come up with a new estimate of the prior 
distribution that increases the value of the log- 
likelihood. If we define: 



Q(q U \q) 

M 



^2 ^2 r\ 3 \w\wi)\og(p c (wi\w)q'i{w)) 



i=l w£b. 
At 



Er U)r i~m Pc(u>i\w)q' (w) 

i=i w&bl V q[w) , 



(19) 



then we have: 

£'W(q^,q')-£'W(q^,q^) 

= Q{q {3 \q)-Q{q {3 \q {3) ) 
At 

+ Y / D(4 3 \.\w l )\\r' l (-\m)) 



i=l 



where ( 20 



> Q(q u ',q) - Q(q {3 ',q 

) holds because D{r^f\-\w 



(20) 

> 

(Cover and Tho mas, 1996[ ). Thus, we just need 
to find the value of q' that maximizes Q(q^\q'), 
as this will guarantee that C'^ (qW , q') > 
C'( 2 '(qv\qW). The distribution q' can be written: 



q'( w ) = ^2^tq(w\t). 



(21) 



Thus, Q, as a function of A, can be written as: 



Q(q {j) A) 

M 

i=l w£bi 



{ ^ ( w\Wi) log 



p c (wi\w) E f hq{w\t) 
v E«,'66iEi ^tq{w'\t) 



(22) 



Equation (22) cannot be easily maximized with re- 
spect to A by simple differentiation, because the ele- 
ments of A appear in the denominator, making them 
coupled. Instead, we will derive and maximize a 



lower bound for the Q-difference ( 20 ) 



For the rest of the derivation we assume that \ t = 
Y^lnti > where fi t £ Tl- We can thus express Q as a 
function of = (//j, . . . , ht) as follows: 

Q(q {3) ,») = 

2^ 2^ ry>(w\ Wi )log [ — — ^ 

i=l w£bi 
M 



^2 r i J) (^i^) io g 



p c {w i \w)Y^t e ' J ' t( l( w \ t ) 



i=l w£bi 



(23) 



Interestingly, the fact that the sum Et e>lt appears 
in both numerator and denominator in the above ex- 
pression allows us to discard it. 

At iteration j + 1, the goal is to come up with an 
update 6, such that /j,V> + S results in a higher value 
for Q; i.e., we require: 

Q(q®,tt® + S)> Q(q( j \n^) : (24) 



where pV) is the weight vector that resulted after 
optimizing Q in the j-th iteration. Obviously, 



q (j) (w) 



e^t q(w\t) 



Let us consider the difference: 



Q(q®,H® + 6)-Q(q®,p®) 



(25) 



(26) 



M 



Y Y u ( w \™i) x 

1=1 w&bi 

W 



log 



Et^ 3) E*/ e6i <?KI*) 



E^ e^+S(4)' 
Et^H*) . 



log 



(27) 



We use the well-known inequality log(x) < x — 1 
— log(z) > 1 — x and obtain: 



log 



> 



Et^ E^sM*) 



.0) 



1 Et^ 

Et^E^sM*) ' 

Next, we apply Jensen's inequality to obtain: 



(28) 



,0) 



log 



Et e^t ) q(w\t) 



> 



E t ^ gH^Mt 



(29) 

By combining ( p7] >, ( 28 ) and ( [29] ), we obtain a lower 
bound on the Q-difference of ([26]): 



M 



Y Y r i Mm) 

i=l u>£&; 



,0) 



1 + 



E t ^ gH^t 



.0') 



(30) 



E t e* E«,' 66i ?KI*) J 
It now suffices to find a 5 that maximizes the lower 



bound of (30 1, as this will guarantee that the in- 



difference will be greater than zero. Note that the 



lower bound in (30) is a concave function of St, and 



it thus has a global maximum that we will find using 



differentiation. Let us set g(S) equal to the right- 



hand-side of ( j30| ). Then, 

M 



dg{8) 



85, 



Y Y r i Mm)* 



i=l wGbi 



,0) 



+ 



L Ef e M *' 5 T, w ' &i q{w'\t')) ' Et'^'\(w\t>)\ 



(31) 



By setting (31 1 equal to and solving for St, we ob- 



,U) 



tain the update for 5 t (or, equivalently, e M * + *): 

/ M 



YY 



(w | Wi ) e M « J) q(w \ t ) 



' M 



i webi Ef eMt/ q(u>\t' 
Y.w&j q( w \t) \ 



(32) 



-iEt'^' E w ^qHt')J 

4.2 Learning A from Expected Counts 

As before, we now consider maximizing the ex- 
pected log-likelihood of the ASR output: 



c n{2) = E 



M 



Y l °SP(Wi) 



i=l 



(33) 



where Wi takes the value w G bi with probabil- 
ity equal to the confidence (posterior probability) 
Si(w) of the recognizer. The modified maximum- 
likelihood estimation problem now becomes: 



\* nil (2) 

A = argmaxL ^ > 

M 

x 

A 



arg max Sj(iA>) logpj( 

i=l u>£bi 
A/ 

arg max Si(w)x 

1=1 u>Gf>i 

M4E.'' e ,Et'At',KIO 



(34) 



By following a procedure similar to the one de- 
scribed earlier in this section, we come up with the 



lower bound on the Q-difference: 



lower bound on d38[), 



V) „(j) 



+S)-Q(q( j \^) > 



M 



E E r i Hw')Si(w') 
1=1 w^w'abi 



E^^E^sM*) 



.CO 



Et^ E w 'eh 



(35) 



whose optimization results in the following update 
equation for 5 t : 



r- (io|io )Si(w )e M * <7(io|i) 

xo 



EE 

. i=l w.w'e&j Ef eM *' ?(H*0 



' M 

E 



,0) 



(36) 



4.3 Maximum-Aposteriori Estimation of A 

Finally, we consider a MAP estimation of A with 
the confusion model. The optimization contains the 



term /31og(Dir(A; a)) and the Q-difference (20 1 is: 



Q{q^\^ +8)-Q(q^\n^) + 

/?log(Dir(e^ )+<5 /^i)) - /31og(Dir(e" (3) /^o)) 

(37) 

where Zq, Z\ are the corresponding normalizing 
constants. In order to obtain a lower bound on the 



difference in the second line of f37j ), we consider 
the following chain of equalities: 



log(Dir(e^ )+< 7Zi)) - log(Dir(e^7Z )) 



CO 



(a log 



,,( j) j_a / ,/C0 
e H t +o t & H t 



\{E V ^),{E v e^ 



a-1 



E^- rl °g E 



.CO 



0) 



la 



CO 



,0) 



<5f 



a 



1) 



E^- r E A 



Cf)x 



(39) 



The lower bound ( 39 1 now becomes part of the lower 
bound in ( |30] > (for the case of the 1-best) or the lower 



bound (35) (for the case of expected counts), and 
after differentiating with respect to St and setting the 
result equal to zero we obtain the update equation: 



= (A + P(a-l)(l-TX 



t ))x 
\ -l 



E 



,co 



(40) 



i=i Et'^*' Eweb^Ht') 
where A in ([40]) is equal to 



r\ {w\Wi)e^t q{w\t) 

i=l w&h l^t' e ' Q\ w \t ) 
in the case of using just the 1-best, and 



M 

E 

i=l w,w'abi 



E 



r ( f\w\w')si(w')e' J ' tJ q(w\t) 



CO 



(41) 



(42) 



in the case of using the lattice, 
(ii) Case 2: a > 1. 

We apply the well-known inequality log(x) < x — 1 
and obtain the following lower bound on ([38]), 



(a-1) 



E« 




i 



(43) 



As before, the lower bound ( [43] ) now becomes part 
of the lower bound in (|30j> (for the case of the 1- 



best) or the lower bound ([35]) (for the case of ex- 
pected counts), and after differentiating with respect 
to St and setting the result equal to zero we obtain 
the update equation: 



(38) 



(A + p(a-l))x 
E w eb,lHt) 



E 



.CO 



P(a - 1)T 



-1 



We distinguish two cases: 
(i) Case 1: a < 1. 

We apply Jensen's inequality to obtain the following 



v-iE^' E w&l qMt') ^ v 

where A in ( |44] > is equal to ( [41) or ([42]), depending 
on whether the log-likelihood of the 1-best or the 
expected log-likelihood is maximized. 




Figure 2: Perplexities of each language model. "Content-word" perplexity is computed over those words whose 
reference transcript counts do not exceed the threshold thr (shown at the top of each plot). We consider MLE 
and MAP solutions based on 1-best and expected counts (tf) for both ASR output self-training (solid bars) and our 
confusion model (dashed bars). Overall perplexity (which is dominated by function words) is not affected significantly. 



Summary: We summarize our methods as graph- 
ical models in Figure [T] We estimate the topic dis- 
tributions A for a conversation C by either self- 
training directly on the ASR output, or including 
our ASR channel confusion model. For training on 
the ASR output, we rely on MLE 1-best training 
(([5]),([6]>, left figure), or expected counts from the lat- 
tices (fTOb. For both settings we also consider MAP 



estimation: 1-best (12i and expected counts (13 1. 
When using the ASR channel confusion model, we 



derived parallel cases for MLE 1-best ((32), middle 



figure) and expected counts ( 36 1, as well as the MAP 



(right figure) training of each ((41 1,(42 1). 



5 Experimental Results 

We compare self-training, as described in Section[3j 
with our confusion approach, as described in Section 
[4j on topic-based language model adaptation. 

Setup Speech data is taken from the Fisher tele- 
phone conversation speech corpus, which has been 
split into 4 parts: set A\ is one hour of speech used 
for training the ASR system (acoustic modeling). 



We chose a small training set to simulate a high 
WER condition (approx. 56%) since we are primar- 
ily interested in low resource settings. While conver- 
sational speech is a challenging task even with many 
hours of training data, we are interested in settings 
with a tiny amount of training data, such as for new 
domains, languages or noise conditions. Set A%, a 
superset of A\, contains 5.5mil words of manual 
transcripts, used for training the topic-based distri- 
bution q(-\t). Conversations in the Fisher corpus are 
labeled with 40 topics so we create 40 topic-based 
unigram distributions. These are smoothed based on 
the vocabulary of the recognizer using the Witten- 



Bell algorithm ( |Chen and Goodman, 1996[ ). Set B, 
which consists of 50 hours and is disjoint from the 
other sets, is used as a development corpus for tun- 
ing the MAP parameter (3 (a — 1). Finally, set C (44 
hours) is used as a blind test set. The ASR channel 
and the topic proportions A are learned in an unsu- 
pervised manner on both sets B and C. The results 
are reported on approximately 5-hour subsets of sets 
B and C, consisting of 35 conversations each. 



BBN's ASR system, Byblos, was used in all 
ASR experiments. It is a multi-pass LVCSR sys- 
tem that uses state-clustered Gaussian tied-mixture 



models at the triphone and quinphone levels dPrasad 
et al., 2005] ) . The audio features are transformed us- 
ing cepstral normalization, HLDA and VTLN. Only 
ML estimation was used. Decoding performs three 
passes: a forward and backward pass with a tri- 
phone acoustic model (AM) and a 3-gram language 
model (LM), and rescoring using quinphone AM 
and a 4-gram LM. These three steps are repeated af- 
ter speaker adaptation using CMLLR. The vocabu- 
lary of the recognizer is 75k words. References of 
the dev and test sets have vocabularies of 13k and 
Ilk respectively. 

Content Words Our focus on estimating confu- 
sions suggests that improvements would be mani- 
fest for content words, as opposed to frequently oc- 
curring function words. This would be a highly 
desirable improvement as more accurate content 
words lead to improved readability or performance 
on downstream tasks, such as information retrieval 
and spoken term detection. As a result, we care more 
about reducing perplexity on these content words 
than reducing overall scores, which give too much 
emphasis to function words, the most frequent to- 
kens in the reference transcriptions. 

To measure content word (low-frequency words) 



improvements we use the method of Wu and Khu 



danpur (2000), who compute a constrained version 
of perplexity focused on content words. We re- 
strict the computation to only those words whose 
counts in the reference transcripts are at most equal 
to a threshold thr. Perplexity ( Jelinek, 1997] ) is 
measured on the manual transcripts of both dev 
and test data based on the formula PPL(p) = 
lQ~\c\^i=i lo ^io(p( w i))^ whej-e p represents the es- 
timated language model, and |C| is the size of the 
(dev or test) corpus. We emphasize that constrained 
perplexity is not an evaluation metric, and directly 
optimizing it would foolishly hurt overall perplexity. 
However, if overall perplexity remains unchanged, 
then improvements in content word perplexity re- 
flect a shift of the probability mass, emphasizing 
corrections in content words over the accuracy of 
function words, a sensible choice for improving out- 
put quality. 



Results First we observe that overall perplexity 
(far right of Figure [2]) remains unchanged; none 
of the differences between models with and with- 
out confusion estimates are statistically significant. 
However, as expected, the confusion estimates sig- 
nificantly improves the performance on content 
words (left and center of Figure [2]) The confu- 
sion model gives modest (4-6% relative) but statis- 
tically significant (p < 2 ■ 10~ 2 ) gains in all con- 
ditions for content (low-frequency) word. Addition- 
ally, the MAP variant (which was tuned based on 
low-frequency words) gives gains over the MLE ver- 
sion in all conditions, for both the self-supervised 
and the confusion model cases. This indicates that 
modeling confusions focuses improvements on con- 
tent words, which improve readability and down- 
stream applications. 

ASR Improvements Finally, we consider how our 
adapted language models can improve WER of the 
ASR system. We used each of the language mod- 
els with the recognizer to produce transcripts of 
the dev and test sets. Overall, the best language 
model (including the confusion model) yielded no 
change in the overall WER (as we observed with 
perplexity). However, in a rescoring experiment, 
the adapted model with confusion estimates resulted 
in a 0.3% improvement in content WER (errors re- 
stricted to words that appear at most 3 times) over 
the unadapted model, and a 0. 1% improvement over 
the regular adapted model. This confirms that our 
improved language models yield better recognition 
output, focused on improvements in content words. 

6 Conclusion 

We have presented a new model that captures the 
confusions (errors) of the ASR channel. When 
incorporated with adaptation of a topic-based lan- 
guage model, we observe improvements in model- 
ing of content words that improve readability and 
downstream applications. Our improvements are 
consistent across a number of settings, including 
1-best and lattice self-training on conversational 
speech. Beyond improvements to language mod- 
eling, we believe that our confusion model can aid 
other speech tasks, such as topic classification. We 
plan to investigate other tasks, as well as better con- 
fusion models, in future work. 
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